Reducing query time of rainfall data using {duckdb} and {arrow} in R

Details

In this demonstration, we use gridded rainfall data from the National Oceanic and Atmospheric Administration (NOAA) to demonstrate the gain in run times and queries using local .duckdb files, the duckdb & arrow R packages.

This document serves as a demonstration of working with database structure and spatial analysis for applying to Code With Us opportunity R Shiny interface for the climr package (downscaled monthly climate data)

Procedure

The data is saved in two methods.

  • In Amazon AWS S3 buckets as .parquet files, partitioned by years.
    This is how the data is saved.

  • A local .duckdb file that contains the data. This is a single file.

Results

We choose a min_date and a max_date and use them to query data from the two storage methods mentioned above. We will use an R package tictoc to show the run-time differences.

Some codes relating to connections and credentials have been removed owing to company policy

Show the code
get_daily_data = function(conn, date_min, date_max, years_selected, months_selected){
    tictoc::tic()
  temp <- dplyr::tbl(conn, "daily_precip_data") |>
    #Need to have the Year Filter ==> because the data is partitioned with the Years, Otherwise it will get very slow.
    dplyr::filter(Year %in% years_selected) |>
    dplyr::filter(Month %in% months_selected) |>
    dplyr::filter(Date >= date_min, Date <= date_max) |>
    dplyr::group_by(GRIDCODE) |>
    dplyr::summarise(
      precip_sum = sum(Precip_daily) /10, #Keep in mind that we store everything as integer ==> need to divide by 10 to get data in mm!
      From = min(Date),
      To = max(Date)
    ) |>
    dplyr::collect()
  tictoc::toc()
  
  return(temp)
}

Query from S3 buckets .parquet files

query_from_db = get_daily_data(con_db, date_min, date_max, years_selected, months_selected)
597.911 sec elapsed
print(str(query_from_db))
tibble [14,129 × 4] (S3: tbl_df/tbl/data.frame)
 $ GRIDCODE  : int [1:14129] 30329 34229 26130 27930 29430 30030 32430 34830 20731 21631 ...
 $ precip_sum: num [1:14129] 2324 1094 791 2064 2971 ...
 $ From      : Date[1:14129], format: "2022-01-01" "2022-01-01" ...
 $ To        : Date[1:14129], format: "2023-12-31" "2023-12-31" ...
NULL

Query from local .duckdb file

query_from_local = get_daily_data(con_local, date_min, date_max, years_selected, months_selected)
0.296 sec elapsed
print(str(query_from_local))
tibble [14,129 × 4] (S3: tbl_df/tbl/data.frame)
 $ GRIDCODE  : int [1:14129] 16989 17889 20889 21189 25989 26289 26589 11590 11890 20290 ...
 $ precip_sum: num [1:14129] 1926 2766 2262 2216 1542 ...
 $ From      : Date[1:14129], format: "2022-01-01" "2022-01-01" ...
 $ To        : Date[1:14129], format: "2023-12-31" "2023-12-31" ...
NULL

All the results are real time!

Note

Notice the run-time improvement with the local file implementation

A similar method may be used for Phase 2 - Database optimization of the project if duckdb is allowed. Otherwise, a similar method using PostGIS may be implemented

Map using mapgl

Finally, a map using {mapgl} to demonstrate my experience of working with climate data.

Some random values have been added to the data before plotting to protect data privacy.

Show the code
tooltips = glue("<strong>{map_data$COUNTY_NAME}, {map_data$STATE_NAME}</strong> <br><strong>{map_data$precip_sum}mm</strong>")
map_data = cbind(map_data, tooltips)

colors = c("#fff", viridisLite::turbo(8, direction = -1)[3:8])
min_precip = min(map_data$precip_sum)
max_precip = max(map_data$precip_sum)
values = seq(min_precip, max_precip, length.out = 7) |> round()

map = maplibre(style = carto_style("positron")) %>%
  add_line_layer(
    id = "border",
    source = us_states,
    line_color = "#aaa",
    line_cap = "round"
  ) %>% 
  fit_bounds(us_states, animate = FALSE, padding = 60) %>% 
  add_fill_layer(
    id           = "precipitation_data",
    source       = map_data,    # The joined grid plot with precip_sum values
    fill_color   = interpolate(
      column = "precip_sum",     # The column in your dataset to base the colors on
      values = values,   # Precip_sum value breakpoints
      stops = colors,  # Colors for each breakpoint
      na_color = "transparent"   # Use transparent for NA values or 0 values
    ),
    tooltip = "tooltips",
    hover_options = list(
      fill_color = "black",
      fill_opacity = 1
    ),
    fill_opacity = 0.5  # Adjust opacity for non-zero values as needed
  ) |> 
  add_legend(
    "Cumulated rainfall(mm)",
    values = values,
    colors = colors,
    position = "bottom-right",
    width = "325px"
  ) |>
  add_symbol_layer(
    id           = "state_labels",
    source       = us_states,
    text_field   = "{NAME}",   # Correctly reference the 'NAME' field for state names
    text_size    = 13,         # Set font size for labels
    text_color   = "black",
    text_halo_width = 1,       # Add halo around text for readability
    text_halo_color = "white"  # Set halo color
  )
      
map